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Abstract 

The genetic code has been shown to be very error robust compared to randomly selected codes, but 
to be significantly less error robust than a certain code found by a heuristic algorithm. We formulate 
this optimisation problem as a Quadratic Assignment Problem and thus verify that the code found 
by the heuristic is the global optimum. We also argue that it is strongly misleading to compare the 
genetic code only with codes sampled from the fixed block model, because the real code space is 
orders of magnitude larger. We thus enlarge the space from which random codes can be sampled from 
approximately 2.433 x 10^^ codes to approximately 5.908 x 10''^ codes. We do this by leaving the fixed 
block model, and using the wobble rules to formulate the characteristics acceptable for a genetic code. 
By relaxing more constraints three larger spaces are also constructed. Using a modified error function, 
the genetic code is found to be more error robust compared to a background of randomly generated 
codes with increasing space size. We point out that these results do not necessarily imply that the code 
was optimized during evolution for error minimization, but that other mechanisms could explain this 
error robustness. 

Index Terms 

Genetic code, error robustness, origin of life. 

I. Background 

The genetic code is the set of rules according to which nucleic acid sequences are translated 
into amino acid sequences. Although a few small variations on the standard genetic code are 
known (especially in mitochondrial systems), this set of rules is essentially the same for all 
organisms. The genetic code is therefore one of the most fundamental aspects of biochemistry. 
The pattern of codon assignments in the genetic code appears to be organized in some way (Table 
IJ). First, there is codon similarity for codons encoding the same amino acid. The underlying 
biochemical reason yj is (partly) that tRNA molecules often recognize more than one codon. A 
second phenomenon is that similar amino acids are often specified by similar codons. One way 
to quantify amino acid similarity is to use the values of polar requirement introduced by Woese 
et al. [2J. According to this measure amino acids with a polar side chain like glutamate and 
aspartate have a high value (12.5 and 13.0, respectively), while hydrophobic amino acids like 
leucine and valine have a low value (4.9 and 5.6, respectively). An example of similar codons 
coding for similar amino acids is asparagine, specified by codons AAU and AAC with a polar 
requirement of 10.0 and lysine, specified by AAA and AAG, with a polar requirement of 10.1. 
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TABLE I 

The standard genetic code. Assignment of the 64 possible codons to amino acids or stop signals, with 
polar requirement of the amino acids indicated in brackets. 



UUU Phe (5.0) 
UUC Phe (5.0) 
UUA Leu (4.9) 
UUG Leu (4.9) 


UCU Ser (7.5) 
UCC Ser (7.5) 
UCA Ser (7.5) 
UCG Ser (7.5) 


UAU Tyr (5.4) 
UAC Tyr (5.4) 
UAA STOP 
UAG STOP 


UGU Cys (4.8) 
UGC Cys (4.8) 
UGA STOP 
UGG Trp (5.2) 


CUU Leu (4.9) 
cue Leu (4.9) 
CUA Leu (4.9) 
CUG Leu (4.9) 


ecu Pro (6.6) 
CCC Pro (6.6) 
CCA Pro (6.6) 
CCG Pro (6.6) 


CAU His (8.4) 
CAC His (8.4) 
CAA Gin (8.6) 
CAG Gin (8.6) 


CGU Arg (9.1) 
CGC Arg (9.1) 
CGA Arg (9.1) 
CGG Arg (9.1) 


AUU He (4.9) 
AUG lie (4.9) 
AUA He (4.9) 
AUG Met (5.3) 


ACU Thr (6.6) 
ACC Thr (6.6) 
ACA Thr (6.6) 
ACG Thr (6.6) 


AAU Asn (10.0) 
AAC Asn (10.0) 
AAA Lys (10.1) 
AAG Lys (10.1) 


AGU Ser (7.5) 
AGC Ser (7.5) 
AGA Arg (9.1) 
AGG Arg (9.1) 


GUU Val (5.6) 
GUC Val (5.6) 
GUA Val (5.6) 
GUG Val (5.6) 


GCU Ala (7.0) 
GCC Ala (7.0) 
GCA Ala (7.0) 
GCG Ala (7.0) 


GAU Asp (13.0) 
GAG Asp (13.0) 
GAA Glu (12.5) 
GAG Glu (12.5) 


GGU Gly (7.9) 
GGC Gly (7.9) 
GGA Gly (7.9) 
GGG Gly (7.9) 



Although one may suspect that similar codons code for similar amino acids may also be present 
in a random grouping Haig and Hurst Hi, |151 showed that this is not the case. Random 
codes do not have this property to the same extent as the standard genetic code. 

Haig and Hurst JH generated by computer a large number of alternative genetic codes, in which 
the blocks coding for amino acids in the standard genetic code, e.g. the UCU, UCC, UCA, UCG, 
AGU, AGC block encoding serine, were kept the same, but their assignment to an amino acid 
was randomly redistributed (a procedure generally called "swapping"). We will refer to this as 
the fixed block model |[6l. Note that the use of the word "block" is different from the use in 
studies such as |I3, jHl. We use the word "block" as in [[6l and BUl: in the sense of the collection 
of all codons specifying the same amino acid or chain termination ("STOP" in Table I!]). We will 
call the collection of all codons sharing the same first and second nucleotide "box". The space 
of codes which is created as a result of random code generation under the fixed block model, 
denoted as Space 0, contains exactly 20! 2.433 x 10^^) codes. 



As a measure for the quality of a code the change in polar requirement caused by one step 
point mutations in the codons was proposed. Each codon has nine codons to which it can 
mutate in one step: e.g. for the UCU serine codon, these are UCC, UCA, UCG (these three 
remain coding for serine in the actual code), UUU (coding for phenylalanine, a 2.5 difference 
in polar requirement), UAU (coding for tyrosine, a 2.1 difference), UGU (coding for cysteine, a 
2.7 difference), CCU (coding for proline, a 0.9 difference), ACU (coding for threonine, also a 
0.9 difference), and GCU (coding for alanine, a 0.5 difference). The quality of the code is then 
measured by averaging over all squared differences: MSq. In this calculation, Haig and Hurst 
JH ignored the three "stop codons" which are coding for chain termination. In this way, 263 
connections between adjacent codons contribute equally to MSq. 

To facilitate the mathematical formulation of MSq we introduce an undirected graph G = {V, E) 
that has the 61 codons as its vertices and an edge between any two codons if they differ in only 
one position, yielding 263 edges. Let = (V^ , E^) be the graph obtained by adding the 3 
stop codons to V, yielding 288 edges. A code F maps each codon c to exactly one amino acid 
F(c). We denote by r(F(c)) the polar requirement of the amino acid that codon c encodes for 
w.r.t. code F. The error function of code F is then given by 

MSo{F) = ^ E ir{F{c))-r{F{c'))f. 

{c,c'}eE 

Using MSq as a quality measure of a genetic code Haig and Hurst found that only 1 out 
of 10,000 random codes performs better, i.e. has a lower MSq, than the standard genetic code 
O. This shows that in the standard genetic code not only identical amino acids are encoded 
by similar codons, but also similar amino acids are encoded by similar codons. Originally, Haig 
and Hurst flU, [[5l investigated three other characteristics beside polar requirement (like e.g. the 
isoelectric point), but the correspondence between codon assignments and error robustness with 
respect to polar requirement was most striking. It may be interesting to find other measures which 
perform equally well, or better. However, the measure has to be independent from the genetic 
code (this point has been made in connection with the use of values derived from replacement 
mutations known from sequence data). We have to be careful not to artificially create a measure 
that is based on the genetic code itself. To keep results comparable to the work of Haig and 



Hurst, use of polar requirement is preferable. 



The work of Haig and Hurst was soon followed by the work of Goldman H, who found a 
code using a heuristic method that has a lower MSq value than any of the codes generated 
before. In Section III-AI we verify that Goldman's code is in fact the global optimum in the fixed 
block model. 

Freeland and Hurst [9] presented four histograms to visualise the particular error robustness, 
in the sense of Haig and Hurst [ffl, of the standard genetic code. They reported that with respect 
to the MSq value, 114 codes out of the 1,000,000 random codes had a lower value than the 
standard genetic code. They also reported similar results with respect to the MS measure restricted 
to point mutations in the first, second and third codon, respectively denoted by MSi, MS2 and 
MSs- To define them we partition the edge set E in the graph representation G = {V, E) of the 
adjacency structure of codons, depending on the position in which two adjacent codons differ: 
El is the set of edges between two codons that differ only in the first position, E2 the set 
of edges between two codons that differ only in the second position, and E-^ the set of edges 
between two codons that differ only in the third position. Clearly these sets are disjoint and 
^ = ^1 U ^2 U E3. Then for p = 1, 2, 3, 

MSp{F) = ^ J2 {r{F{c))-r{F{c'))f, 

' {c,c'}eEp 

where |X| denotes the cardinality of X i.e. the number of elements in X. In fact, li^il = 87, 
\E2\ =88 and lE^] = 88. The results of Freeland and Hurst show that there is not much error 
robustness for mutations in the middle position of the codon; the third position, however, is 
extremely robust against changes in polar requirement. 

Subsequent research following this approach has concentrated on nuancing the error function 
||9l, [[Toll . [[TT|. [fT2l|. lfT3l or taking a parameter different from polar requirement as an amino 
acid characteristic [fTOll . [fTTIl . [[Till . [fT3l . The common theme in most of these approaches is the 
code space from which random alternative codes are generated; in ifTTI this space is referred 
to as "possible code space" and we denote this space as Space 0. Remarkably, known genetic 
code variations lie outside Space 0. In code variations certain individual codons are reallocated 
from one block to another. The fixed block structure of the standard genetic code is thereby 
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replaced by an alternative, slightly different, fixed block structure. In Section III-CI we construct 
four progressively larger code spaces (denoted Space 1, Space 2, Space 3 and Space 4), which 
encompass successively more known genetic code variations next to the standard genetic code. 
To be able to compare the genetic code with respect to alternative codes sampled randomly from 
Spaces 1 and 2, we nuance the MS measure such as to accommodate values of polar requirement 
for stop codons. In this paper, we aim at refining several points in the seminal work by Haig 
and Hurst JH, [[Si, Goldman |l6l and Freeland and Hurst ifTTI . Apart from determining the global 
minimum, the refinements concern the code space structure and the kind of conclusions assumed 
to be possible to draw based on the research. We do not intend to change the characteristic taken 
to represent the amino acid (which is polar requirement in the work of Haig and Hurst flU, ^ 
and Goldman ||6|) or to weigh the three positions of the codons differently in the error function 
(as is done in the second part of BU and most subsequent work). We only intend to enlarge the 
space from which random codes are sampled, and find out how they relate to dH. 

II. Results 
A. Goldman 's best solution is the global minimum 

Goldman [[6l applied a heuristic algorithm for finding the best code under the fixed block model. 
The best solution he found had an MSq value of 3.489, which was well below the value of 
5.194 reported by Haig and Hurst flU, Q for the standard genetic code. A heuristic does not 
guarantee that the code found is optimal. We designed an exact method for finding the optimal 
code by formulating the minimization problem as a Quadratic Assignment Problem (QAP) |[T4| 
and solved it using the exact QAP-solver QAPBB [15]. An intuitive formulation of QAP is as 
follows. We are given two sets of objects Vi and V2 of equal size. We are to match each object 
from Vi to exactly one object from V2 such that all objects of V2 are matched as well; as a 
result we get a perfect matching (pairing) of the objects of Vi and V2. In the ordinary (linear) 
assignment problem, there is a cost for assigning object i from Vi to object k from V2 and we 
wish to find the assignment that minimizes total cost. In QAP the cost is dependent on pairs 
of assignments: there is a cost for assigning object i from Vi to object k from V2 and object j 
from Vi to object ^ from V2. Again we wish to minimize the total assignment cost. 

If we consider the set of objects Vi to be the 20 blocks in Table HI and the set of objects V2 
to be the 20 amino acids, then we can model the minimization of MSq by letting the cost of 



assigning one amino acid to one block and another amino acid to another block be given by 
the difference of their polar coordinates times the number of point mutations between the two 
blocks. In Section IV-AI we define this problem formally as a 0-1 integer program with quadratic 
objective and linear constraints. 

QAP is an NP-hard problem, meaning that it is probably hard to solve lfT6l . However, small 
instances of QAP can be solved effectively using an exhaustive enumeration technique known as 
branch and bound This searches (implicitly) through the entire space of solutions, keeping 
note of the best solution found so far, and ignoring parts of the solution space that could not 
possibly lead to a better solution. Even with branch and bound it is in general not feasible to use 
the QAP model for finding a code with minimum MSq value in any reasonable time when we 
leave the fixed block model. However, we could find the global minimum M So value in Space 
0. We found the same solution as Goldman, certifying that his solution was in fact the optimal 
one. 

B. Incorporating stop codons 

Leaving the fixed block model required us to nuance the MS measure and attach a value of 
polar requirement to the stop signal. Chain termination is produced by Release Factors (RFs), 
which are proteins, and therefore most probably later elements of the coding system than tRNAs. 
This is an argument which can also be found in e.g. [fT3l ("... I do not want to assume that there 
were stop codons in the current positions from the beginning, because it is more likely that 
stop codons were a late addition to the code, after the main layout of most of the codons 
was already established"). Genetic codes lacking stop codons are not impossible. During the 
evolutionary development of the genetic code, mRNAs could have been short, and the last sense 
codon of a message could have been the end of the mRNA. After attaching the last amino acid 
of the polypeptide, the primordial ribosome could move further along the mRNA, and both the 
polypeptide and the mRNA could lose the association with the ribosome, as the tape leaving the 
tape recorder in the classical analogy. The more sophisticated mechanism with Release Factors 
could have evolved later, to make things run more smoothly. When this is the scenario of evolution 
of chain termination we follow, we want the stop codons to have the smallest influence on our 
calculations possible. 
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How to assign values to stop codons? 

There are at least four possible ways to deal with the stop codons. In the work described 
in Section IJ the stop codons were ignored and no value was assigned to them. A second way to 
deal with stop codons is to assign a fixed value to a stop codon. A third way would be to assign 
a fixed value to the mutation to a stop codon, which would be the same for all amino acids. 
The last way to deal with the problem would be to mimic the natural process of suppression. 

1) Assigning no value to stop codons: Ignoring stop codons in the calculation as has been 
done until now [HI, [[6]|, BUl is not the way in which their influence is the smallest possible. 
This is because they eliminate a lot of the edges from . For the UCA serine codon, in the 
previous treatment only the edges to UCU, UCC, UCG, UUA, CCA, ACA and GCA take part 
in the calculation. The edges to UAA and UGA are ignored, which means in fact that they 
behave towards serine as if those codons were encoding serine. Due to this effect, the four 
alanine codons have a stronger influence on the calculation than the four glycine codons. Thus 
ignoring stop codons artificially favors certain amino acids. This effect will even become more 
pronounced when we enlarge the space of possible codes. For example, if we allow codes to have 
as many as four stop codons (like our mitochondrial code), or to have stop codons in unusual 
places (like the UUA and UUG stop codons of the mitochondria of Pycnococcus provasolii ifTSl ). 

2) Assigning a fixed value to a node (i.e. give the stop codon a fixed value): If we were to 
reason that a mutation to a stop codon would lead to truncation of messages, we might be inclined 
to attach a very large value to a stop codon (because truncated proteins would be non-functional 
and the mutation therefore lethal). To model "lethal", we could use the value "infinity". This 
makes our calculation useless. We could also attach a polar requirement of 1,000,000 to a stop 
codon. In this case the stop codons are going to dominate the calculation and this is exactly 
what we didn't want to begin with. 

3) Assigning a fixed value to an edge (i.e. give the mutation to a stop codon a fixed value): 
There is another way to model the concept that a mutation to a stop codon is worse than 
a mutation to a sense codon. One could assign a fixed penalty to a mutation to "stop", no 
matter which amino acid is mutated to stop. One relatively large value which could be given 
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as a penalty is the difference in polar requirement between the two most dissimilar amino 
acids. The disadvantage of this approach is again the domination of the calculation by the stop 
mutations. Although less dominating than the very high fixed values suggested for the stop nodes, 
this approach still has the stop codons dominating the calculation, and possibly obscuring the 
phenomenon we want to see. 

4) The suppression approach: What would happen if there is a mRNA with a codon which 
does not have a tRNA? In such a case, one possibility is that decoding is performed by the 
tRNA which, among the tRNA repertoire present in the system in consideration, is the most 
similar to the one which would be needed to decode the codon regularly. This phenomenon 
is called "suppression" in molecular biology lfT9l . In the living cell, the cognate tRNA or RF 
competes with several different potential suppressor tRNAs for decoding a codon [|20ll . By using 
in the calculation the value which would be there in case of the most probable suppression, 
a value is attached to a stop codon which results in a relatively small influence of the stop 
codons in the calculation. The most probable suppression for a stop codon ending on A, is by 
the tRNA which recognizes the sense codon ending on G from the same box. This is reflected 
by genetic code variants: apparently suppressing tRNAs often evolve towards full recognition. 
We can illustrate this with the UGA codon, which can be found in the top right-hand corner of 
Table HI Because the most probable suppression for UGA is by the tRNA which normally reads 
UGG as tryptophan, genetic code variants in which both UGA and UGG encode tryptophan 
evolved multiple times. Although there exists an organism in which UGA is encoding cysteine, 
the more frequent reassignment for UGA is to tryptophan. The same phenomenon is found for 
AUA, which can be found towards the bottom left-hand side of Table HI AUA has been reassigned 
several times to methionine. Suppression of AUA codons in protein coding sequences by the 
tRNA which is normally reading the AUG codons has apparently been followed by the evolution 
of full recognition of the AUA codon by this tRNA. Assigning to a stop codon ending on a purine 
(A or G) the value of polar requirement of the amino acid specified by the other purine-ending 
codon in the box is therefore a possible way to deal with stop codons. This obviously can not be 
done when both purine-ending codons in a box are stop codons. Genetic code variants suggest 
an approach also in this case. In bilaterian mitochondria the tRNA which decodes AGA and 
AGG (recall Table H the AGA and AGG codons can be found towards the bottom right-hand 
side) as arginine in the standard code is not present. The tRNA which decodes AGU and AGC 
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as serine usually takes over the function of decoding AGA and AGG by reading them as serine 
IfSl . This suggests the approach: if in one box both purine-ending codons are stop codons, the 
value of polar requirement of the amino acid specified by the codons ending on a pyrimidine (U 
or C) in that box can be assigned to them. This is always a single amino acid because the two 
pyrimidine-ending codons in the same box always code for the same amino acid. Until now, no 
genetic code variants are discovered with pyrimidine-ending stop codons, so our approach is to 
develop only a way to deal with stop codons ending on purines. 

How to modify the MS measure? 

By treating the stop codons as sense codons according to the suppression approach, we simplified 
the MS measure. In the notation introduced before, 

MSl{F) = ^ J2 {r{F{c))-r{F{c'))f, 

and similarly w.r.t. the three positions p = 1, 2, 3 of the codons 

MS^{F) = ^ E ir{F{c))-r{F{c')))\ 

In this way, all 64 codons contribute equally to the error measure. Note that \E^\ = 288 and 
that \Ef\ = \E^\ = \E^\ = 96. It should be realized that by using MS^ or MS^ we do not 
necessarily start working in a space larger than Space 0. We can use MSq and MSp when we 
generate random codes from Space 1 or Space 2 (see Section III-CI) but we can also use M Sq 
and M Sp when we generate random codes from Space 0. 

We investigate how the new measure reflects the nature of Space when used as a background 
to study the standard genetic code (Table U). We produce four plots as in [[9|. The plots (Figure 
[T]) have the same general shape as the four plots in [9]. In particular, the prominent shoulder at 
the left side is present in both the MS^ (Figure [11(d)) and the MS3 [9] frequency distributions. 
The spikes present in the plots in [SH are not present. They are an artifact of rounding errors 
in both the data and the bin borders of the histograms. The combination of MS values rounded 
to two digits after the decimal point and bin border values which are repetitive binary fractions 
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rounded by the histogram software, are probably the source of the spikes in BH. 

The global minimum code in Space for the M Sq measure was also found using the quadratic 
assignment approach described in Section III-AI We calculated the average of both M Sq and 
MSq of 1,000,000 randomly generated codes as well as the global minimum in Space with 
respect to both measures (Table |IIl). Clearly, both measures give similar results. We also studied 
the proportions of random codes better than the standard genetic code with the MSq measure. 
Out of 1,000,000 random codes 156 codes had a lower MS'q -value than the standard genetic 
code, resulting in a proportion Pq of 0.000156. This was also investigated for p = 1,2 and 3 
(Table Uni). Again the MS and MS^ measures give similar results (as can be detected also from 
the plots of Figure IB- 

We conclude that it is acceptable to replace MS by MS^ to study the character of the standard 
genetic code compared to randomly generated ones. MS^ gives the same results in all essential 
aspects, and can be used to investigate larger spaces and spaces with different codons used as 
chain termination signal. 

TABLE II 

Comparison of MSq and MSq ■ Values were calculated for lO" randomly sampled codes from Space 0. The 

AVERAGES AND VARIANCE ARE SHOWN; MSo IS TAKEN FROM 





MSo 


MSi^ 


Mean of random codes 


9.41 ± 1.51 


9.43 ± 1.89 


Standard genetic code (rounded) 


5.194 


5.501 


Global minimum code (rounded) 


3.489 


3.946 



TABLE III 

Comparison of proportions of "better codes" for MS and MS^ . 



MS 




Po = 0.000114 


Pi = 0.000156 


Pi = 0.002964 


Pf = 0.012369 


P2 = 0.221633 


Pi = 0.129075 


P3 = 0.000088 


Pi = 0.000078 
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C. Enlarging the "possible code space" 

Space has a fixed block structure. It is possible to leave this fixed block structure and generate 
randomly genetic codes, without relaxing all biochemical constraints. In this section we develop 
a method to enlarge the space from which codes are sampled randomly, by specification of 
allowed subdivision of boxes. 

Space does not even cover all existing genetic codes: the only existing genetic code present in 
Space is the standard genetic code. By studying code variants general rules with respect to the 
possible ways to construct a genetic code can be found. Using these rules, we enlarge the code 
space progressively. Genetic code variants are derived from the standard genetic code, as can be 
concluded by studying the codon assignments of close relatives. For mitochondrial code variants 
this is recently described in [[8]|. The number of code variants apart from mitochondria is very 
small and it is nowadays believed that they all are derived from the standard code (although this 
was less clear when the very first variants were discovered). Although these variants probably 
emerged after the standard genetic code, we use the larger spaces because they contain possible 
ways for constructing genetic codes with the system found in living organisms on Earth. 

In the standard genetic code, the box in the top left-hand comer (see Table U) shows one of several 
ways in which a box can be subdivided according to the codon-anticodon pairing patterns allowed 
by the simple wobble rules [[H, [|2T]|. The codons UUU and UUC are assigned to one amino 
acid, and the codons UUA and UUG to another. Recognition of both pyrimidine-ending codons 
by one tRNA molecule is the wobbling behavior of G in the first position of the anticodon as 
proposed by Crick yj. Modification of U (in the first position of the anticodon) to thio-U restricts 
the wobbling behavior of the tRNA molecule to recognition of both purine-ending codons [22], 
ll23l . A second pattern of subdivision is presented by a box towards the bottom left-hand side of 
Table IB In this box AUU, AUG and AUA are assigned to one amino acid and AUG is assigned 
to another. The existence is known of tRNA molecules which recognize all three codons in the 
top of a box [24 J. Recognition of the G-ending codon only, is the wobbling behavior of C in the 
first position of the anticodon as proposed by Crick [JJ. Therefore, this pattern of subdivision 
of a box can be understood by the pairing characteristics of tRNA molecules. In eight boxes of 
Table H] all four codons are assigned to one amino acid, as in the box in the bottom right-hand 
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comer: GGU, GGC, GGA and GGG are assigned to the same amino acid. Recognition of all 
four codons of a box seems to be the wobbling behavior of a tRNA molecule with unmodified 
U in the first anticodon position ll25l . In summary, the wobbling behavior of tRNA molecules 
allows subdivision of boxes with only sense codons in three ways: no subdivision, division 
in a pyrimidine-ending pair and a purine-ending pair, and division in a set of three codons 
in the top of a box, and a single codon at the bottom. Although extensive modifications of 
anticodons in contemporary organisms can lead to much more complex patterns of wobbling 
behavior [|26|. [|27|. [f22|. [|24l. for the purpose of enlarging Space we do not take these 
aspects of the wobble phenomenon in account. These modifications are produced by proteins, 
and therefore were probably not present during the development of the coding system. To allow 
the modifications of U to thio-U (enabling the exclusive recognition of purine-ending codons) 
and A to I (enabling the recognition of three codons by one tRNA molecule) is already pushing 
the limit concerning capacities credibly attributable to a very early living system. 

Further subdivisions of boxes are possible when stop codons are added to the possibilities 
in a box. Because stop codons ending on pyrimidines are not discovered yet, we restrict the 
possibilities to purine-ending stop codons only. This adds four further ways to subdivide a box. 
The upper two codons assigned to an amino acid, and the lower two codons being stop codons 
is the first. The upper three codons assigned to one amino acid, and the bottom codon being a 
stop codon is the second. The upper two codons assigned to one amino acid, the third codon 
being a stop codon, and the last codon assigned to an amino acid, but different from the amino 
acid assigned to the upper two codons, is the third possibility. The last possibility again has 
the third codon being a stop codon, but the three remaining codons are assigned to the same 
amino acid in this case. Taken together with the three possibilities for subdivision with only 
sense codons presented in the previous paragraph, we arrive at seven possible ways to subdivide 
a box according to the simple wobbling behaviour without extensive anticodon modification. 
This is summarized in Table HVl We generate block structures uniformly at random according 
to the rules described in Table ITVl the block structures consist of 21 blocks. 

In our first extension, the "stop block" consists of three stop codons, as in the standard genetic 
code. However, their location is free, under the condition that they do not end in U or C. The 
number of codons allocated to any amino acid is free, as long as each amino acid is encoded by 
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TABLE IV 

Possible types of boxes, a = amino acid, b = amino acid, different from a. s = stop. 





iVlCcUllIlg 


AAAA 


All 4 codons recognized by the same tRNA (or by several tRNAs carrying the same 
dmino diAO-}. 


AAAB 


I cL/UgiiiZiCU vy icL'Ugiiiz.cu uy cunjiiici iivLNrt 

carrying a different kind of amino acid. 


AAAS 


NNU, NNC, NNA recognized by a tRNA, NNG by a Release Factor (RF). 


AABB 


NNU, NNC recognized by one tRNA, NNA, NNG by another tRNA carrying a 

different kind of amino acid. 


AASA 


NNU, NNC recognized by one tRNA, NNA by a RF, NNG by another tRNA, but 
carrying the same amino acid. 


AASB 


NNU, NNC recognized by one tRNA, NNA by a RF, NNG by another tRNA, carrying 

a dillcrciU kind ol' amino acid. 


AASS 


NNU, NNC recognized by one tRNA, NNA, NNG recognized by a RF. 



at least one codon. In this way we obtain a first enlarged space, Space 1, that is more realistic 
than Space 0. Space 1 is, with approximately 5.908 x 10*^^ possible codes, much larger than 
Space (with approximately 2.433 x 10^® codes). 

To include most existing genetic code variations, which differ in the number of stop codons, we 
enlarged Space 1 to Space 2, by allowing the codes to have — 4 stop codons. 

For completeness, we also define two more spaces but we will not use them in our calculations. 
In some bacteria some codons are not used: neither tRNAs nor release factors to recognize them 
(without suppression) are present. To include these code variations too we in addition add a 
new block "unassigned" to our block structure, allowing the number of unassigned codons to 
range between and 40 (Space 3). Every codon is allowed to be unassigned, with the restriction 
that codons ending on U or C are either both assigned or both unassigned. Space 3 contains all 
existing natural genetic code variations. 

Finally (Space 4) we also include codes with fewer or more than 20 amino acids. In many 
speculations on the origin of the genetic code, codes with less than 20 amino acids play a role; 
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Jukes suggested such an evolutionary pathway already in 1966 [[28||. With the extreme of just 
one codon in use, the number of unassigned codons ranges from to 63. The size of Space 
4 is approximately 1.120 x 10^° codes. The sizes of Spaces 0-4 are presented in Table IVl In 
Section EB] we explain the methods behind sampling and counting. The presence of unassigned 
codons in Spaces 3 and 4 causes the function MS^ to be ill-defined. Therefore we could not 
investigate the nature of these spaces, as we will do for Spaces 1 and 2. 



TABLE V 

Sizes and characteristics of the five progressively larger spaces. Number of codes present in Spaces 
— 4. The block structure of Spaces 1 - 4 is free, except for the constraints imposed by adherence to 
THE Wobble Rules, and the specifications listed under "Characteristics of codes". 



Space 


Characteristics of codes 


Approximate size of 
space 


Space 


21 blocks, 20 amino acids, 3 stop codons, unassigned 
codons 


2.433 X 10^* 


Space 1 


21 blocks, 20 amino acids, 3 stop codons, unassigned 
codons, free block structure 


5.908 X 10*^ 


Space 2 


20-21 blocks, 20 amino acids, 0-4 stop codons, 
unassigned codons, free block structure 


1.932 X 10*^ 


Space 3 


20-22 blocks, 20 amino acids, 0-4 stop codons, 0-40 
unassigned codons, free block structure 


8.635 X 10** 


Space 4 


2-34 blocks, 1-32 amino acids, 0-4 stop codons, 0-63 
unassigned codons, free block structure 


1.120 X 10^" 



Figure [2] shows four plots (as in Figure [T]) of MS''^-values, but of codes sampled from Space 
1 rather than Space 0. We notice the great similarity with the plots in Figure [IJ Despite the 
fact that Space 1 is about 2 x 10^'' times larger than Space 0, the mean MS'q -value is still 
about 10. The frequency distributions have the same general nature, and the position of the 
frequency distribution relative to that of the standard genetic code has not changed. We also 
notice that the prominent shoulder at the left side of the M Sf frequency distribution in Figure 
[T]has disappeared. We conjecture that the particular block structure of the standard genetic code 
is responsible for this shoulder. 
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Figure [3] shows the same four plots for Space 2. It is hard to find differences with Figure 
[21 The genetic code seems a bit more special against the background with progressively larger 
spaces: the number of "better codes" found with a million randomly generated codes decreased 
from 156 in Space 0, via 7 in Space 1, to just a single one (Table IVTl) in Space 2. 

TABLE VI 

Basic descriptive statistics of Space 0, Space 1 and Space 2. From each space 10^ codes were randomly 

SAMPLED. 



Measure 


Space 


Space 1 


Space 2 


Mean ± variance 








MSi 


9.426 ± 1.89 


10.663 ±3.13 


10.665 ±3.12 


MSf 


12.100 ±6.37 


12.362 ±5.88 


12.368 ± 5.86 


MSi 


12.627 ±6.33 


12.270 ± 5.79 


12.278 ± 5.79 


MSi 


3.550 ± 2.09 


7.358 ±4.51 


7.348 ± 4.49 


Proportion of better 
codes found 








Pi 


0.000156 


0.000007 


0.000001 


pf 


0.012369 


0.004853 


0.004864 


Pi 


0.129075 


0.151506 


0.150269 


Pi 


0.000078 


0.000000 


0.000000 



in. Discussion 

We now compare five published possible scenarios concerning the evolution of the genetic 
code and show they are not inconsistent with low MS values. 

A. Evolution of the genetic code by selection for error minimization 

The concept that the codon assignments are a feature of living organisms which protects them 
against damage to the genetic information and which is, as such, specifically selected for by 
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natural selection, was first published by Sonnebom [29|. Woese ll30l elaborated on this concept 
by pointing out that it is much more probable that translation errors instead of mutations in the 
genome were the errors against which the system in which the genetic code was developed had to 
be protected. The concept and first experiment of producing computer-generated random codes 
to compare with the genetic code was published by Alff-Steinberger ll3Tll . This author points out 
that the differences found depending on the codon position suggest translation errors rather than 
mutations as responsible for determining (in part!) the structure of the code. Haig and Hurst 
im developed the MS measure and were able to generate much more random codes than Alff- 
Steinberger. They again found differences depending on codon position, but left the possibility 
open, that "... the code acquired its major features before the evolution of proteins" flU, implying 
that selection for protection against errors in protein-coding messages maybe played no role in 
the evolution of the genetic code. Freeland and Hurst Q elaborated on the work of Haig and 
Hurst, and presented the code as "one in a million": "We thus conclude not only that the natural 
genetic code is extremely efficient at minimizing the effects of errors, but also that its structure 
reflects biases in these errors, as might be expected were the code the product of selection" BU. 
The extreme version of the "Error Minimization Hypothesis" would be that all possible codes 
were tested by natural selection, and the standard genetic code was the best. With a measure 
which would be a good model for the errors against which the genetic code was optimized, the 
standard genetic code would then be found to be the global minimum code. There probably are 
no scientists who adhere to such an extreme variant of the "Error Minimization Hypothesis". It 
is, however, tempting to see the low MSq value as an indication that specific selection for error 
minimization was a major determinant of the codon assignments in the standard genetic code 
(e.g. [32]. 



B. The Sequential "2-1-3" Model of Genetic Code Evolution 

Figure [3] shows the main result of dUl remains valid when Space is enlarged to Space 1, 
and subsequently to Space 2: the MS value of the standard genetic code is better than the 
MS value of the average code when point mutations in the second position are considered; 
it is much better when point mutations in the first position are considered; and it is so much 
better when point mutations in the third position are considered that better codes in this respect 
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are not visible in the graphs. This could point to the chronological order in which the codon 
positions acquired coding information. Massey [|33l . ||34l . ||35l published a series of papers in 
which the sequential acquisition of coding information by the second, then the first, and finally 
the third codon position is the major determinant of the codon assignments in the standard 
genetic code. According to this "2-1-3" model, the genetic code started with full degeneracy in 
the side positions. The amino acid repertoire would originally have been limited to four amino 
acids, and coding information was carried by the middle position. Subsequently the amino acid 
repertoire was expanded by assigning coding information to the first position. Because the code 
expansion would be "...facilitated by duplication of the genes encoding adaptor molecules and 
charging enzymes" [34J, amino acids of similar properties would be assigned to codons with 
the same middle nucleotide. Selection on error minimization plays a limited role in the "2-1-3" 
model in so far that code expansion via duplication of adaptor molecules followed by mutation 
of the middle position of the anticodon is selected against. Hence: "... amino acids of similar 
properties were selectively assigned to codons separated from one another by a single potential 
mutation" [[33l . Finally, a further expansion was possible by assigning coding information to 
the third codon position. A consideration of the structure of the tRNA anticodon leads Massey 
to conclude that the third codon position is intrinsically the most error-prone. Therefore it is 
logical that distinguishing codons unambiguously on the third position is only possible when 
protein biochemistry has already progressed beyond the initial stages. Massey states that his 
analyses "...demonstrate that a substantial proportion of error minimization is likely to have 
arisen neutrally, simply as a consequence of code expansion, facilitated by duplication of the 
genes encoding adaptor molecules and charging enzymes. This implies that selection is at best 
only partly responsible for the property of error minimization" [|34ll . The concept of a genetic 
code in which coding information was carried by the middle position only, has been around 
since the sixties (e.g. with Crick: "For example, only the middle base of a triplet may have been 
recognized, a U in that position standing for any of a number of hydrophobic amino acids, an 
A for an acidic one etc." [|3l). The "2-1-3" model, however, goes further than that: it presents 
the chronological order in which the codon positions acquired coding information as the major 
determinant of the error minimization present in the code. The low MSq value is not incompatible 
with the "2-1-3" model; to the contrary, the "2-1-3" model is based on the low MSq value. 



19 



C. The Frozen Accident Theory 

A third scenario is the Frozen Accident Theory of Crick [3]. In this scenario, "... the actual 
allocation of amino acid to codons is mainly accidental and yet related amino acids would be 
expected to have related codons" O. This is because there "...are several reasons why one 
might expect [...] a substitution of one amino acid for another to take place between structurally 
similar amino acids. First, [...] such a resemblance would diminish the bad effects of the initial 
substitution. Second, the new tRNA would probably start as a gene duplication of the existing 
tRNA for those codons. Moreover, the new activating enzyme might well be a modification of 
the existing activating enzyme. This again might be easier if the amino acids were related. Thus, 
the net effect of a whole series of such changes would be that similar amino acids would tend 
to have similar codons, which is just what we observe in the present code" [[31. Please note 
that in text preceding this fragment the possibility has been raised that "... the primitive tRNA 
was its own activating enzyme" [3J, which is a description of a ribozyme avant la lettre. At a 
certain moment the system would reach a stage in which "... more and more proteins would 
be coded and their design would become more sophisticated until eventually one would reach a 
point where no new amino acid could be introduced without disrupting too many proteins. At 
this stage the code would be frozen" Q. Please note that on the very first page of the paper 
the possibility is mentioned that the genetic code is not exactly identical for all organisms, 
although for widely different organisms it had been found to be very similar. Therefore the word 
"frozen" was probably from the start meant to be interpreted with a small degree of flexibility. 
The concept "relatedness" of amino acids is not rigorously defined in the paper, but Crick 
presents three examples of what he considers to be groups of related amino acids. "All codons 
with U in the second place code for hydrophobic amino acids". The polar requirements of this 
specific group of hydrophobic amino acids are 5.0, 4.9, again 4.9, 5.3, and 5.6. A second group 
of "related" amino acids is described in: "The basic and acidic amino acids are all grouped 
near together towards the bottom right-hand side ..." The polar requirements of this group of 
charged (and thus hydrophilic) amino acids are 10.1, 9.1, 13.0, and 12.5. The third example 
is the group of aromatic amino acids: "Phenylalanine, tyrosine and tryptophan all have codons 
starting with U". The polar requirements of these are 5.0, 5.4, and 5.2. Because "related" amino 
acids according to Crick tend to share a similar polar requirement, the low MSq value is not 
incompatible with the "Frozen Accident Theory". A clear difference between the "2-1-3" model 
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and the "Frozen Accident Theory" is the presence of pairs of "related" amino acids with a 
second position difference in the latter: e.g. lysine-arginine, and phenylalanine-tyrosine. In this 
respect, it is relevant to observe that the MS value of the genetic code is lower than the MS 
value of the average code when point mutations in the second position are considered. Both the 
"2-1-3" model and the "Frozen Accident Theory" are scenarios in which the genetic code is 
basically a piece of historical information. Differences between these two scenarios are a lack 
of emphasis on sequential acquisition of coding information for the different codon positions 
in Crick's scenario; and a "refusal" by Crick to have a role for specific selection for error 
minimization in the scenario: "There is no reason to believe, however, that the present code 
is the best possible, and it could have easily reached its present form by a sequence of happy 
accidents. In other words, it may not be the result of trying all possible codes and selecting the 
best. Instead, it may be frozen at a local minimum which it has reached by a rather random 
path" L3J. 



D. The Stereochemical Theory 

A fourth scenario is what Crick named "The Stereochemical Theory" [[3]|. According to this 
scenario there is a physico-chemical relationship between certain nucleic acid triplets and certain 
amino acids. The first such proposal was published by Gamow [36] . Woese spent a lot of effort 
collecting evidence for the support of the Stereochemical Theory [|37l, ll30l , ll38l . JIl, [[39ll . Orgel 
described this scenario as follows: "The simplest theory suggests that the role of tRNA's was 
originally filled by a set of much shorter polynucleotides, perhaps the anticodon trinucleotides 
themselves. In this form, the theory postulates that trinucleotides have a selective affinity for 
the amino acid coded by their complementary trinucleotide. Of course, the selectivity must have 
been limited in the first place, but it is argued that it might have been sufficient to produce 
primitive activating enzymes in the presence of a suitable messenger RNA. Then the system 
could have perfected itself by the "bootstrap" principle, [...]. If this type of theory is correct the 
code is not arbitrary; if life were to start again, certain features of the code would be reproduced 
because the physical interactions on which it is based are unchanging" [40J. Exactly these kind 
of unchanging physical interactions are found in a number of recently published experiments 
(EU, [Il2l, 031 and references therein). Anticodons like GAA, GUA, GUG, and CCA are part 
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of RNA molecules which bind respectively phenylalanine, tyrosine, histidine, and tryptophan. 
Again, phenylalanine and tyrosine form a group of amino acids coded by codons with U in 
the first position (contributing to a low MSq value), but in this scenario the formation of the 
group is due to a straightforward binding affinity of a GAA-containing RNA for phenylalanine, 
and another one of a GUA-containing RNA for tyrosine. Earlier experimental work pointed 
to a stereochemical relationship between the anticodons GCC, AGC and GAC and the simple 
amino acids glycine, alanine and valine respectively ll44l . The same author published models in 
which e.g. asparagine and lysine were shown binding their cognate anticodons p31. If the major 
determinant for the codon assignments in the standard genetic code is stereochemical affinity 
between triplets and amino acids as reported in these publications, this implies a low MSq value. 
Therefore, the Stereochemical Theory is not incompatible with a low MSo value. 



E. A four-column theory for the origin of the genetic code 

The four scenarios discussed above share the characteristic that one factor (either "mini- 
mization", "history" or "stereochemistry") is the major determinant of the codon assignments 
in the standard genetic code. They share this characteristic with the scenarios published by 
Wong H6l and by Ikehara ll47l . Other scenarios are present in which all three factors are major 
determinants ll48l . ||49l . As a last scenario, we discuss the four-column theory published by 
Higgs [fT3]|. Like the scenario proposed by Massey, the earliest genetic code according to the 
four-column theory is encoding a repertoire of four amino acids. Higgs is very detailed on the 
amino acids and the codon assignments in this earliest genetic code: the sixteen codons with U in 
the middle originally encoded valine, the sixteen middle-C codons alanine, the sixteen middle-A 
codons aspartate, and the sixteen middle-G codons glycine. Later amino acids were added to 
this code by a process of subdivision of these 16-codon blocks, in which a subset of the codons 
assigned to an early amino acid were reassigned to a later amino acid. In the four-column theory, 
codons with a certain middle position are reassigned to amino acids similar to the one originally 
assigned to codons with this middle position because this is the least disturbing to already 
existing protein sequences. The driving force for the reassignment is the "positive selection for 
the increased diversity and functionality of the proteins that can be made with a larger amino 
acid alphabet" [fT3l . An intermediate code is presented, with Leu, He and Val coded by middle- 
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U codons, Ser, Pro, Thr and Ala coded by middle-C codons, Asp and Glu coded by middle-A 
codons, and all middle-G codons coding Gly. At this stage, the total of protein-coding sequences 
starts to influence the further development of the code strongly (code-message coevolution, as in 
the series of papers by Sella and Ardell llSOll . [ISTIl . [|52ll . [|53l ) because, as a consequence of their 
function in proteins, glycine codons become rare codons. The consequence of this is that the 
constraint to reassign them is relaxed. The final result is that amino acids which are not similar 
to glycine, but which are associated with strong positive selection because they bring radical new 
functions for proteins (cysteine, tryptophan and arginine) are found coded by middle-G codons. 
Although Higgs emphasizes that the driving force during the process of expansion of the amino 
acid repertoire is not the minimization of translational error, the four-column theory is not as 
"neutral" as the "2-1-3" model, because the "minimal disruption to the proteins already encoded 
by the earlier code" by adding "...later amino acids into positions formerly occupied by amino 
acids with similar properties" is such an important component of the scenario. 

Like the other discussed scenarios, the four-column theory is compatible with a low MSq value. 
All five discussed scenarios agree that error robustness due to codon assignments is present in 
the standard genetic code. The scenarios differ in the way they propose the error robustness has 
been built. 

F. Consequence of the error robustness 

The consequence of the error robustness is an enormous potential to evolve. A variation in 
an RNA sequence can have different kinds of consequences in the protein sequence. At the one 
end of the spectrum, the different codon does not lead to a different amino acid. Slightly more 
effect would be that a different codon would lead to a different amino acid, but this would be 
so similar to the original amino acid that no difference in protein structure is the consequence. 
Most important would be the effect that there is a difference in protein structure, but so small 
that natural selection can use it as a slight step along an evolutionary path. At the far end of the 
spectrum, finally, we find the lethal mutations. Because of this graded intensity of evolutionary 
effect, the nature of the relationship between RNA sequence and protein sequence (i.e. the genetic 
code) gives biochemistry an enormous evolvability ll54l . Il55l . This not necessarily implies that 
the phenomenon itself is built by direct optimizing selection for the error minimizing aspects 
(exactly the same argument holds for the aspects of stop codons allowing additional information 
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to be encoded in protein-coding sequences as described by Itzkovitz and Alon [.56.1 ). 

IV. Conclusions 

Formulating the minimisation problem as a Quadratic Assignment Problem, we certify that 
3.489, Goldman's best solution [|6i is in fact the optimal one. In spite of its theoretical hardness, 
the size of the problem allows for exact solution methods instead of mere heuristics, that may fail 
in finding the optimal solution. We demonstrated that it is possible to sample from much larger 
and more realistic code spaces. Leaving Space 0, and using simple wobble rules we constructed 
four progressively larger code spaces. Their size is of a completely different order than that 
of Space 0. Spaces 3 and 4 contain all existing genetic code variations. Using a modified MS 
measure, the nature of Spaces 1 and 2 could be investigated. In Spaces 1 and 2, the standard 
genetic code was found to be a little more error robust when compared to randomly generated 
codes than it was found to be in Space 0. Finally, limitation of error robustness as a means to 
decide between different evolutionary scenarios is discussed. 

V. Materials and methods 

A. Quadratic Assignment Problem 

We formulate determining the minimum MSq as a Quadratic Assignment Problem. We use 
the graph model presented in the Section U for adjacency of the codon pairs. We number the 
amino acids Ai, . . . , A20 and the blocks in the standard genetic code . . . , i?2o- We introduce 
binary decision variables xik, i = 1, . . . , 20, A; = 1, . . . , 20; xik gets value 1 if amino acid Ai is 
assigned code block and value otherwise. If Xik = 1 and Xji = 1 then this contributes to 
the objective a value 

dikji= Yl (riA,) - riA,)f . 

ceBk,c'eBi,{c,c'}eE 

To find the code with minimum MS'o -value we minimise 

20 20 20 20 

^ ^ ^ ^ dikjgXikXje, 
i=i j=i k=i e=i 

subject to the restrictions 

20 

^ Xjfc = 1, for A; = 1, ... , 20, 

i=l 
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20 

^Xik = 1, for z = 1, . . . ,20, 

k=l 

ensuring that each block encodes some amino acid and that each amino acid is encoded by some 
block, and the restrictions 

Xifc G {0, 1}, for i = 1, . . . , 20, k = l,..., 20, 

ensuring that blocks cannot be assigned fractionally to some amino acid and for another fraction 
to some other amino acid. 

A similar model can be used to compute the code achieving minimum MSq value, although 
it requires time in the order of weeks to compute, as opposed to hours for the MSq value. 
Further extending the above model, to compute the minima of the even larger code spaces, 
leads to programs that even state-of-the-art algorithms cannot solve in any reasonable amount 
of computer time. 

B. Counting and sampling 

TABLE VII 
tRNA induced counts 



Amino acids 


Stop codons 


Unassigned 


Multiplicity 


tRNA Patterns 








4 


1 


"uuuu" 





I 


3 


2 


"uusu" 


"uuus" 





2 


2 


1 


"uuss" 










1 


"aaaa" 







1 


2 


"aaau" 


"aaua" 







2 


2 


"aauu" 


"uuaa" 







3 


1 


"uuua" 




I 





2 


"aaas" 


'aasa" 




1 


1 


2 


"aasu" 


'aaus" 




I 


2 


1 


"uusa" 




2 





1 


"aass" 


2 








2 


"aaab" "aabb" 


2 





1 


1 


"aaub" 


2 


I 





1 


"aasb" 



In Table IVIII we have listed the possible ways to fill a single box that are compatible with the 
considered tRNA wobble rules. Let {pi, . . . ,pm} enumerate the possible tRNA patterns as listed 
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in the rightmost column of Table IVIIl We write a{p), s{p), u{p) for the number of amino acids, 
stop codons and unas signed codons present in pattern p. 

Problem. We now consider the problem of filling 16 boxes (64 codons in total) using 20 different 
amino acids, s stop codons and u unassigned codons. It is useful to solve a slightly more general 
problem: the number of ways to fill h boxes using 

• amino acids, 

• each of the first a amino acids at least once, 

• exactly s stop codons, and 

• exactly u unassigned codons. 

The original problem is obtained by setting a = = 20 and & = 16. 

Recurrence. We denote the number of such fillings by #Ar(6, a, s, u) and compute their values 
by the recurrence 



Rationale. The reasoning behind ([T]) is the following. We fill box number h first, and worry 
about the remaining boxes later. We iterate over the possible tRNA patterns with variable i. To 
realise pattern pi we need a{jpi) amino acids, s{pi) stop codons and u{pi) unassigned codons. 
There is only one way to choose stop codons and unassigned codons, but we can obtain the 
amino acids from two sources. We can take some from the a still-to-use amino acids that we 
have to use at least once, and we must take the others from the N — a free amino acids that can 
be used as desired. We consider all possible ways to realise the choice: We first iterate over the 
number of amino acids that we take from the still-to-use pool with variable j. Selecting j out 
of a still-to-use amino acids can be done in (") ways. Similarly, taking the remaining a{pi) — j 
amino acids from N — a free amino acids can be done in ) ways. All these a{pi) chosen 




1=1 j=0 



with basis 



#jv(0,0,0,0) = l 



#iv(6, a, s, u) = if Ab < a + s + u, 



26 



amino acids are different, and so there are a{pi)\ ways to instantiate the pattern using them. Now 
we still have to fill the remaining 6—1 boxes, using the remaining a — j still-to-use amino acids 
at least once, while using exactly s — s{pi) stop codons and leaving u — u{pi) codons unassigned. 

Implementation. The value #20 (16, 20, s,u) can be efficiently evaluated by dynamic program- 
ming. This is achieved by storing all intermediate values of 7^ that are computed in memory, 
and recalling them when they are needed instead of reevaluating 7^. This way, ^]\f{b, a, s,u) 
can be evaluated in time and space 0{basu). Note that a single call to ^Nib, a, s,u) computes 
#Ar(6', a', s', u') for many b' < b, a' < a, s' < s and u' < u. 

Sampling. The above dynamic programming implementation has the advantage that it allows 
uniform sampling over the space of all codes. We first sample a number uniformly between 1 
and #Ar(6, a, s, u). Then we use the recurrence in reverse to determine which code this number 
corresponds to. This is done as follows. Say the number sampled was n. We then incrementally 
evaluate the sum of ([I])- Once the partial sum up to i surpasses n, we know that pattern pi was 
used in code number n. Similarly we decode which amino acids are used and in which order 
they are placed. By explicitly keeping track of the set of still-to-use amino acids we can retrieve 
the entire code recursively. 
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